library(RColorBrewer)
library(mvtnorm)
library(latex2exp)
library(car)
cols <- brewer.pal(9, "Set1")

Principal Component Analysis

Toy example

What is the best representation of the data in 0 dimensions? Of course it is the sample mean: it minimizes the residual sum of squares (sum of squares differences between the points and the mean).

n <- 100
mu <- c(1,2)
sig <- matrix(c(1,1,1,4), 2, 2, byrow = T)

X <- rmvnorm(n, mu, sig)

par(mar=c(4,4,2,2), family = 'serif')
plot(X, asp = 1, pch = 16, xlab = TeX("$X_1$"), ylab = TeX("$X_2$"))
segments(x0 = X[,1], y0 = X[,2], x1 = colMeans(X)[1], y1 = colMeans(X)[2])
points(colMeans(X)[1], colMeans(X)[2], col = cols[1], pch = 17)

What is the best representation of the data in 1 dimension? It is the line that captures most of the variability of the data!

M <- colMeans(X)
S <- cov(X)

par(mar=c(4,4,2,2), family = 'serif')
plot(X, asp = 1, pch = 16, xlab = TeX("$X_1$"), ylab = TeX("$X_2$"))
points(colMeans(X)[1], colMeans(X)[2], col = cols[1], pch = 17)
ellipse(M, S, 2.5, add = T, col = cols[1], lty = 2)

# Compute the principal components
PC <- prcomp(X, scale = F)
PC1 <- NULL
for(i in 1:n){
  # The loadings are the directions that define the PC
  PC1 <- rbind(PC1, PC$rotation[,1]*PC$x[i,1] + M)
}

par(mar=c(4,4,2,2), family = 'serif')
plot(X, asp = 1, pch = 16, xlab = TeX("$X_1$"), ylab = TeX("$X_2$"))
points(PC1, col = cols[1], pch = 16, cex = 0.75)
segments(x0 = X[,1], y0 = X[,2], x1 = PC1[,1], y1 = PC1[,2], col = cols[1])
points(X, pch = 16)

par(mar=c(4,4,2,2), family = 'serif')
plot(X, asp = 1, pch = 16, xlab = TeX("$X_1$"), ylab = TeX("$X_2$"))
arrows(x0 = c(0,0), y0 = c(0,0), x1 = 5*PC$rotation[1,], y1 = 5*PC$rotation[2,], col = 'red', length = 0.1)

Real data

Let us first load data on food consumption for all European countries.

rm(list=ls())
cols <- brewer.pal(9, "Set1")

food <- read.table('Food.txt', header = T)
scale_food <- as.matrix(scale(food))
p <- ncol(food)

plot(food, pch = 16) 

Each point represents a country. It is hard to visualize such high dimensional data and to understand visually the relationships.

For that, we can use PCA.

pc_fit <- prcomp(food, scale = T)
scores_fit <- pc_fit$x
load_fit <- pc_fit$rotation

The object pc_fit$x is a \(n \times p\) contains the scores of the data, whereas pc_fit$rotation is the rotation matrix, therefore it is a \(p \times p\) matrix that has the PC loadings (directions) in the columns.

Let’s show the proportion of variance explained by each PC.

layout(matrix(c(2,3,1,3), 2, byrow = T))
barplot(pc_fit$sdev^2, las = 2, main = 'Principal Components', ylim = c(0,3), 
        ylab = 'Variances')
plot(cumsum(pc_fit$sdev^2)/sum(pc_fit$sde^2), type = 'b', axes = F, 
     xlab = 'n° components', ylab = '% explained variance', ylim = c(0,1))
abline(h = 1, col = 'blue')
abline(h = 0.8, lty = 2, col = 'blue')
box()
axis(2, at = 0:10/10, labels = 0:10/10)
axis(1, at = 1:p, labels = 1:p, las = 2)
barplot(apply(scale_food, 2, sd)^2, las = 2, main = 'Original Variables', 
        ylab = 'Variances')

We can see that he variances of 5th, 6th, 7th, 8th PC are very small: not useful to explain the data

We can now show the loading plot, which will help us interpret the PC.

par(mar = c(2,2,2,0), mfrow = c(4,1))
for(i in 1:4){
  barplot(load_fit[,i], ylim = c(-1, 1))
  abline(h = 0)
}

Remember that these are the directions of the PCs. The first PC explains \(50\%\) of the variability seems to be highlighting a contrast (some are positive, some negative) between consumption of eggs, milk, mean and pork meat vs cereals and pulse. It is a contrast between livestock and agriculture goods.

Fruits and fish are excluded: they have a tiny contribution in the first PC and they are more important in the second PC. In the second PC, there seems to be a contrast between fish and fruits and the other variables.

Let us now visualize the score plot. It is a plot of the data in the space of the first two (in this case) PC. We can also add the original axis to see how the PC are related to them.

par(mar=c(4,4,2,2), family = 'serif')
biplot(pc_fit, scale = 0, cex = 0.8)

The interpretation is clear: mediterranean countries consume pulse, fruits and fish; eastern european countries consume cereals. The other countries consume mostly mean and dairy products.

Clustering

Hierarchical clustering

Remember the iris dataset? Let’s pretend for a second that we don’t know the species (label coloring).

rm(list=ls())
cols <- brewer.pal(9, "Set1")

data <- as.matrix(iris[sample(1:nrow(iris), nrow(iris)),1:4])

par(mar=c(4,4,2,2), family = 'serif')
pairs(data, pch = 16)

How many groups do you see?

iris_dist <- dist(data, method='euclidean')

iris_sin <- hclust(iris_dist, method = 'single')
iris_ave <- hclust(iris_dist, method = 'average')
iris_com <- hclust(iris_dist, method = 'complete')

par(mar=c(4,4,2,2), mfrow=c(1,3), family = 'serif')
plot(iris_sin, main='euclidean-single', hang=-0.1, xlab='', labels=F, cex=0.6, sub='')
plot(iris_ave, main='euclidean-complete', hang=-0.1, xlab='', labels=F, cex=0.6, sub='')
plot(iris_com, main='euclidean-average', hang=-0.1, xlab='', labels=F, cex=0.6, sub='')

Dendrograms: tree representation of the nested clustering structure. Long vertical branches represent that the clusters are robust.

labels_com <- cutree(iris_ave, k = 3)
labels_com
104 145   5   4 120  33  77  75  34  28 117 114 147  43  60  71  84 146 132  47 
  1   1   2   2   3   2   3   3   2   2   1   3   3   2   3   3   3   1   1   2 
133  88   7  96 112  36 103   3  39  15  99 130 100   6 142 127  86 107  70  19 
  1   3   2   3   1   2   1   2   2   2   3   1   3   2   1   3   3   3   3   2 
124 102  26 128  76  49  12 138  14  40  30  85  27  41  18 106  24  50  46 143 
  3   3   2   3   3   2   2   1   2   2   2   3   2   2   2   1   2   2   2   3 
113 129  95  57  56 136 122  98  94 118  48 123  42 149   8  69  83  92  23  11 
  1   1   3   3   3   1   3   3   3   1   2   1   2   1   2   3   3   3   2   2 
 45 116  29  44  87 115  55  82  53  91  64 110 111  93  10  25  32 121 139 105 
  2   1   2   2   3   3   3   3   3   3   3   1   1   3   2   2   2   1   3   1 
 17 140 126   9  74  65 135  59  81  38 150  73 137  52  51 109   1  16  67  20 
  2   1   1   2   3   3   1   3   3   2   3   3   1   3   3   1   2   2   3   2 
 63  78 141  61  22  37   2  21 101  68  35 108  58  90 131  31  72 119 148  79 
  3   3   1   3   2   2   2   2   1   3   2   1   3   3   1   2   3   1   1   3 
 66  97  13 134  80 144  62  54  89 125 
  3   3   2   3   3   1   3   3   3   1 
par(mar=c(4,4,2,2), family = 'serif')
pairs(data, pch = 16, col = cols[labels_com])

\(k\)-means

Let us try k-means with \(k = 3\).

set.seed(1234)
fit <- kmeans(data, centers = 3)


par(mar=c(4,4,2,2), family = 'serif')
pairs(data, pch = 16, col = cols[fit$cluster])

A good heuristic in practice to find a good value of \(k\) is to see when the within cluster squared distances (variability) stops decreasing drastically.

K <- 20
WCV <- array(NA, dim = K)
for (k in 1:K){
  fit <- kmeans(data, centers = k)
  WCV[k] <- sum(fit$withinss)
}
plot(1:K, WCV, pch = 16)